Disk array system with controllers that prefetch and buffer ATA disk drive commands

ABSTRACT

A disk array system comprises a plurality of controllers, each of which preferably implements a host side of an ATA interface protocol within automated circuitry to control a respective ATA disk drive. Each controller includes a command buffer for storing disk drive commands to be executed by a respective ATA drive, and includes a circuit that prefetches such commands so that a next disk drive command will be available within the command buffer when the disk drive finishes executing a current disk drive command. A delay that commonly occurs when an ATA disk drive retrieves a next disk drive command is thereby reduced or avoided. The disk drive commands are preferably dispatched to the controllers by a microcontroller over a control bus that is separate from a bus used for input/output data transfers.

PRIORITY CLAIM

[0001] This application is a continuation of U.S. patent appl. Ser. No.10/142,562, filed May 9, 2002, which is a continuation of U.S. patentappl. Ser. No. 09/558,524, filed Apr. 26, 2000 (now U.S. Pat. No.6,421,760), which is a continuation of U.S. patent appl. Ser. No.09/034,247, filed Mar. 4, 1998 (now U.S. Pat. No. 6,134,630), whichclaims the benefit of U.S. Provisional Appl. No. 60/065,848, filed Nov.14, 1997.

FIELD OF THE INVENTION

[0002] The present invention relates to disk arrays, and moreparticularly, relates to hardware and software architectures forhardware-implemented RAID (Redundant Array of Inexpensive Disks) andother disk array systems.

BACKGROUND OF THE INVENTION

[0003] A RAID system is a computer data storage system in which data isspread or “striped” across multiple disk drives. In manyimplementations, the data is stored in conjunction with parityinformation such that any data lost as the result of a single disk drivefailure can be automatically reconstructed.

[0004] One simple type of RAID implementation is known as “softwareRAID.” With software RAID, software (typically part of the operatingsystem) which runs on the host computer is used to implement the variousRAID control functions. These control functions include, for example,generating drive-specific read/write requests according to a stripingalgorithm, reconstructing lost data when drive failures occur, andgenerating and checking parity. Because these tasks occupy CPUbandwidth, and because the transfer of parity information occupiesbandwidth on the system bus, software RAID frequently produces adegradation in performance over single disk drive systems.

[0005] Where performance is a concern, a “hardware-implemented RAID”system may be used. With hardware-implemented RAID, the RAID controlfunctions are handled by a dedicated array controller (typically a card)which presents the array to the host computer as a single, compositedisk drive. Because little or no host CPU bandwidth is used to performthe RAID control functions, and because no RAID parity traffic flowsacross the system bus, little or no degradation in performance occurs.

[0006] One potential benefit of RAID systems is that the input/output(“I/O”) data can be transferred to and from multiple disk drives inparallel. By exploiting this parallelism (particularly within ahardware-implemented RAID system), it is possible to achieve a higherdegree of performance than is possible with a single disk drive. The twobasic types of performance that can potentially be increased are thenumber of I/O requests processed per second (“transactionalperformance”) and the number of megabytes of I/O data transferred persecond (“streaming performance”).

[0007] Unfortunately, few hardware-implemented RAID systems provide anappreciable increase, in performance. In many cases, this failure toprovide a performance improvement is the result of limitations in thearray controller's bus architecture. Performance can also be adverselyaffected by frequent interrupts of the host computer's processor.

[0008] In addition, attempts to increase performance have often reliedon the use of expensive hardware components. For example, some RAIDarray controllers rely on the use of a relatively expensivemicrocontroller that can process I/O data at a high transfer rate. Otherdesigns rely on complex disk drive interfaces, and thus require the useof expensive disk drives.

[0009] The present invention addresses these and other limitations inexisting RAID architectures.

SUMMARY OF THE INVENTION

[0010] One particular embodiment of the invention is a disk arraycontroller that comprises a plurality of controllers, each of whichpreferably implements a host side of an ATA interface protocol withinautomated circuitry to control a respective ATA disk drive. Theautomated controllers are connected by a control bus to amicrocontroller that dispatches disk drive commands to the automatedcontrollers in response to I/O requests from a host computer. Theautomated controllers are also connected to a second bus used fortransfers of input/output data.

[0011] In accordance with the invention, each controller preferablyincludes a command buffer for storing disk drive commands to be executedby its respective ATA drive, and includes an automated circuit forprefetching such disk drive commands so that a next disk drive commandwill be available within the command buffer when the respective diskdrive finishes executing a current disk drive command. A delay thatcommonly occurs when an ATA disk drive retrieves a next disk drivecommand is thereby reduced or avoided.

[0012] Neither this summary nor the following detailed description isintended to define the invention. The invention is defined by theclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] There and other features of the architecture will now bedescribed in further detail with reference to the drawings of thepreferred embodiment, in which:

[0014]FIG. 1 illustrates a prior art disk array architecture.

[0015]FIG. 2 illustrates a disk array system in accordance with apreferred embodiment of the present invention.

[0016]FIG. 3 illustrates the general flow of information between theprimary components of the FIG. 2 system.

[0017]FIG. 4 illustrates the types of information included within thecontroller commands.

[0018]FIG. 5 illustrates a format used for the transmission of packets.

[0019]FIG. 6 illustrates the architecture of the system in furtherdetail.

[0020]FIG. 7 is a flow diagram which illustrates a round robinarbitration protocol which is used to control access to thepacket-switched bus of FIG. 2.

[0021]FIG. 8 illustrates the completion logic circuit of FIG. 6 infurther detail.

[0022]FIG. 9 illustrates the transfer/command control circuit of FIG. 6in further detail.

[0023]FIG. 10 illustrates the operation of the command engine of FIG. 9.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0024] I. Existing RAID Architectures

[0025] To illustrate several of the motivations behind the presentinvention, a prevalent prior art architecture used within existingPC-based RAID systems will initially be described with reference toFIG. 1. As depicted in FIG. 1, the architecture includes an arraycontroller card 30 (“array controller”) that couples an array of SCSI(Small Computer Systems Interface) disk drives 32 to a host computer(PC) 34. The array controller 30 plugs into a PCI (Peripheral ComponentInterconnect) expansion slot of the host computer 34, and communicateswith a host processor 38 and a system memory 40 via a host PCI bus 42.For purposes of this description and the description of the preferredembodiment, it may be assumed that the host processor 38 is an IntelPentium™ or other X86-compatible microprocessor, and that the hostcomputer 34 is operating under either the Windows™ 95 or the Windows™ NToperating system.

[0026] The array controller 30 includes a PCI-to-PCI bridge 44 whichcouples the host PCI bus 42 to a local PCI bus 46 of the controller 30,and which acts as a bus master with respect to both busses 42, 46. Twoor more SCSI controllers 50 (three shown in FIG. 1) are connected to thelocal PCI bus 46. Each SCSI controller 50 controls the operation of twoor more SCSI disk drives 32 via a respective shared cable 52. The arraycontroller 30 also includes a microcontroller 56 and a buffer 58, bothof which are coupled to the local PCI bus by appropriate bridge devices(not shown). The buffer 58 will typically include appropriateexclusive-OR (XOR) logic 60 for performing the XOR operations associatedwith RAID storage protocols.

[0027] In operation, the host processor 38 (running under the control ofa device driver) sends input/output (I/O) requests to themicrocontroller 56 via the host PCI bus 42, the PCI-to-PCI bridge 44,and the local PCI bus 46. Each I/O request typically consists of acommand descriptor block (CDB) and a scatter-gather list. The CDB is aSCSI drive command that specifies such parameters as the disk operationto be performed (e.g., read or write), a disk drive logical blockaddress, and a transfer length. The scatter-gather list is an addresslist of one of more contiguous blocks of system memory for performingthe I/O operation.

[0028] The microcontroller 56 runs a firmware program which translatesthese I/O requests into component, disk-specific SCSI commands based ona particular RAID configuration (such as RAID 4 or RAID 5), anddispatches these commands to corresponding SCSI controllers 50. Forexample, if, based on the particular RAID configuration implemented bythe system, a given I/O request requires data to be read from every SCSIdrive 32 of the array, the microcontroller 56 sends SCSI commands toeach of the SCSI controllers 50. The SCSI controllers in-turn arbitratefor control of the local PCI bus 46 to transfer I/O data between theSCSI disks 32 and system memory 40. I/O data that is being transferredfrom system memory 40 to the disk drives 32 is initially stored in thebuffer 58. The buffer 58 is also typically used to perform XORoperations, rebuild operations (in response to disk failures), and otheroperations associated with the particular RAID configuration. Themicrocontroller 56 also monitors the processing of the dispatched SCSIcommands, and interrupts the host processor 38 to notify the devicedriver of completed transfer operations.

[0029] The FIG. 1 architecture suffers from several deficiencies thatare addressed by the present invention. One such deficiency is that theSCSI drives 32 are expensive in comparison to ATA (AT Attachment)drives. While it is possible to replace the SCSI drives with lessexpensive ATA drives (see, for example, U.S. Pat. No. 5,506,977), theuse of ATA drives would generally result in a decrease in performance.One reason for the decreased performance is that ATA drives do notbuffer multiple disk commands; thus each ATA drive would normally remaininactive while a new command is being retrieved from the microcontroller56. One goal of the present invention is thus to provide an architecturein which ATA and other low-cost drives can be used while maintaining ahigh level of performance.

[0030] Another problem with the FIG. 1 architecture is that the localPCI bus and the shared cables 52 are susceptible to being dominated by asingle disk drive 32. Such dominance can result in increasedtransactional latency, and a corresponding degradation in performance. Arelated problem is that the local PCI bus 46 is used both for thetransfer of commands and the transfer of I/O data; increased commandtraffic on the bus 46 can therefore adversely affect the throughput andlatency of data traffic. As described below, the architecture of thepreferred embodiment overcomes these and other problems by usingseparate control and data busses, and by using a round-robin arbitrationprotocol to grant the local data bus to individual drives.

[0031] Another problem with the prior art architecture is that becausethe microcontroller 56 has to monitor the component I/O transfers thatare performed as part of each I/O request, a high-performancemicrocontroller generally must be used. As described below, thearchitecture of the preferred embodiment avoids this problem by shiftingthe completion monitoring task to a separate, non-program-controlleddevice that handles the task of routing I/O data, and by embeddingspecial completion data values within the I/O data stream to enable suchmonitoring. This effectively removes the microcontroller from the I/Odata path, enabling the use of a lower cost, lower performancemicrocontroller.

[0032] Another problem, in at least some RAID implementations, is thatthe microcontroller 56 interrupts the host processor 38 multiple timesduring the processing of a single I/O request. For example, it is commonfor the microcontroller 56 to interrupt the host processor 38 at leastonce for each contiguous block of system memory referenced by thescatter-gather list. Because there is significant overhead associatedwith the processing of an interrupt, the processing of the interruptssignificantly detracts from the processor bandwidth that is availablefor handling other types of tasks. It is therefore an object of thepresent invention to provide an architecture in which the arraycontroller interrupts the host processor no more than once per I/Orequest.

[0033] A related problem, in many RAID architectures, is that when thearray controller 30 generates an interrupt request to the host processor38, the array controller suspends operation, or at least postponesgenerating the following interrupt request, until after the pendinginterrupt request has been serviced. This creates a potential bottleneckin the flow of I/O data, and increases the number of interrupt requeststhat need to be serviced by the host processor 56. It is therefore anobject of the invention to provide an architecture in which the arraycontroller continues to process subsequent I/O requests while aninterrupt request is pending, so that the device driver can processmultiple completed I/O requests when the host processor eventuallyservices an interrupt request.

[0034] The present invention provides a high performance disk arrayarchitecture which addresses these and other problems with prior artRAID systems. An important aspect of the invention is that the primaryperformance benefits provided by the architecture are not tied to aparticular type of disk drive interface. Thus, the architecture can beimplemented using ATA drives (as in the preferred embodiment describedbelow) and other types of relatively low-cost drives while providing ahigh level of performance.

[0035] II. System Overview

[0036] A disk array system which embodies the various features of thepresent invention will now be described with reference to the remainingdrawings. Throughout this description, reference will be made to variousimplementation-specific details, including, for example, part numbers,industry standards, timing parameters, message formats, and widths ofdata paths. These details are provided in order to fully set forth apreferred embodiment of the invention, and not to limit the scope of theinvention. The scope of the invention is set forth in the appendedclaims.

[0037] As depicted in FIG. 2, the disk array system comprises an arraycontroller card 70 (“array controller”) that plugs into a PCI slot ofthe host computer 34. The array controller 70 links the host computer toan array of ATA disk drives 72 (numbered 1-N in FIG. 2), with each driveconnected to the array controller by a respective ATA cable 76. In oneimplementation, the array controller 70 includes eight ATA ports topermit the connection of up to eight ATA drives. The use of a separateport per drive 72 enables the drives to be tightly controlled by thearray controller 70, as is desirable for achieving a high level ofperformance. In the preferred embodiment, the array controller 70supports both the ATA mode 4 standard (also known as Enhanced IDE) andthe Ultra ATA standard (also known as Ultra DMA), permitting the use ofboth types of drives.

[0038] As described below, the ability to use less expensive ATA drives,while maintaining a high level of performance, is an important featureof the invention. It will be recognized, however, that many of thearchitectural features of the invention can be used to increase theperformance of disk array systems that use other types of drives,including SCSI drives. It will also be recognized that the disclosedarray controller 70 can be adapted for use with other types of diskdrives (including CD-ROM and DVD drives) and mass storage devices(including FLASH and other solid state memory drives).

[0039] In the preferred embodiment, the array of ATA drives 72 isoperated as a RAID array using, for example, a RAID 4 or a RAID 5configuration. The array controller 70 can alternatively be configuredthrough firmware to operate the drives using a non-RAID implementation,such as a JBOD (Just a Bunch of Disks) configuration.

[0040] With further reference to FIG. 2, the array controller 70includes an automated array coprocessor 80, a microcontroller 82, and anarray of automated controllers 84 (one per ATA drive 72), all of whichare interconnected by a local control bus 86 that is used to transfercommand and other control information. (As used herein, the term“automated” refers to a data processing unit which operates withoutfetching and executing sequences of macro-instructions.) The automatedcontrollers 84 are also connected to the array coprocessor 80 by apacket-switched bus 90. As further depicted in FIG. 2, the arraycoprocessor 80 is locally connected to a buffer 94, and themicrocontroller 82 is locally connected to a read-only memory (ROM) 96and a random-access memory (RAM) 98.

[0041] The packet-switched bus 90 handles all I/O data transfers betweenthe automated controllers 84 and the array coprocessor 80. All transferson the packet-switched bus 90 flow either to or from the arraycoprocessor 80, and all accesses to the packet-switched bus arecontrolled by the array coprocessor. These aspects of the busarchitecture provide for a high degree of data flow performance withoutthe complexity typically associated with PCI and other peer-to-peer typebus architectures.

[0042] As described below, the packet-switched bus 90 uses apacket-based round robin protocol that guarantees that at least 1/N ofthe bus's I/O bandwidth will be available to each drive during eachround robin cycle (and thus throughout the course of each I/O transfer).Because this amount (1/N) of bandwidth is equal to or exceeds thesustained data transfer rate of each ATA drive 72 (which is typically inthe range of 10 Mbytes/sec.), all N drives can operate concurrently atthe sustained data rate without the formation of a bottleneck. Forexample, in an 8-drive configuration, all 8 drives can continuouslystream 10 Mbytes/second of data to their respective automatedcontrollers 84, in which case the packet-switched bus 90 will transferthe I/O data to the array coprocessor at a rate of 80 Mbytes/second.When less than N drives are using the packet-switched bus 90, each driveis allocated more than 1/N of the bus's bandwidth, allowing each driveto transfer data at a rate which exceeds the sustained data transferrate (such as when the requested I/O data resides in the disk drive'scache).

[0043] In the preferred embodiment, the array coprocessor 80 isimplemented using an FPGA, such as a Xilinx 4000-series FPGA. Anapplication-specific integrated circuit (ASIC) or other type of devicemay alternatively be used. The general functions performed by the arraycoprocessor 80 include the following: (i) forwarding I/O requests fromthe host processor 38 to the microcontroller 82, (ii) controllingarbitration on the packet-switched bus 90, (iii) routing I/O databetween the automated controllers 84, the system memory 40, and thebuffer 94, (iv) performing exclusive-OR, read-modify-write, and otherRAID-related logic operations involving I/O data using the buffer 94;and (v) monitoring and reporting the completion status of I/O requests.With respect to the PCI bus 42 of the host computer 34, the arraycoprocessor 80 acts as a PCI initiator (a type of PCI bus master) whichinitiates memory read and write operations based on commands receivedfrom the automated controllers 84. The operation of the arraycoprocessor 80 is further described below.

[0044] The buffer 94 is preferably either a 1 megabyte (MB) or 4 MBvolatile, random access memory. Synchronous DRAM or synchronous SRAM maybe used for this purpose. All data that is written from the hostcomputer 34 to the disk array is initially written to this buffer 94. Inaddition, the array coprocessor 80 uses this buffer 94 for volumerebuilding (such as when a drive or a drive sector goes bad) and paritygeneration. Although the buffer 94 is external to the array coprocessorin the preferred embodiment, it may alternatively be integrated into thesame chip.

[0045] The microcontroller 82 used in the preferred embodiment is aSiemens 163. The microcontroller 82 is controlled by a firmware controlprogram (stored in the ROM 96) that implements a particular RAID ornon-RAID storage protocol. The primary function performed by themicrocontroller is to translate I/O requests from the host computer 34into sequences of disk-specific controller commands, and to dispatchthese commands over the local control bus 86 to specific automatedcontrollers 84 for processing. As described below, the architecture issuch that the microcontroller 82 does not have to directly monitor theI/O transfers that result from the dispatched controller commands, asthis task is allocated to the array coprocessor 80 (using an efficientcompletion token scheme which is described below). This aspect of thearchitecture enables a relatively low cost, low performancemicrocontroller to be used, and reduces the complexity of the controlprogram.

[0046] Although the microcontroller 82 is a separate device in thepreferred embodiment, the microcontroller could alternatively beintegrated into the same device as the array coprocessor 80. This couldbe done, for example, by purchasing Siemens 163 core (or the core of acomparable microcontroller), and embedding the core within an ASIC thatincludes the array coprocessor logic.

[0047] The control program also includes code for initiating volumerebuilds in response to drive failures, and for handling other types oferror conditions. The particular settings (RAID configuration, rebuildoptions, etc.) implemented by the control program are stored within aprofile table (not shown) in the local RAM 98, and can be modified by asystem administrator using a utility program that runs on the hostcomputer 34.

[0048] The automated controllers 84 are implemented in the preferredembodiment using Xilinx FPGA devices, with two automated controllersimplemented within each FPGA chip. ASICs could alternatively be used.The automated controllers 84 operate generally by communicating withtheir respective drives 72 based on commands (referred to herein as“controller commands”) received from the microcontroller 82, and bycommunicating with the array coprocessor 80 over the packet-switched busto transfer I/O data. As discussed below, the automated controllers 84implement a command buffer to avoid the latency normally associated withhaving to request and wait for the next disk command.

[0049] As further depicted by FIG. 2, the system includes a devicedriver 100 which is executed by the host processor 38 to enable theoperating system to communicate with the array controller 70. In thepreferred embodiment, the device driver 100 is implemented as a SCSIMiniport driver that runs under the Microsoft Windows 95 or NT operatingsystem. The driver 100 presents the drive array to the host computer 34as a SCSI device, which in-turn enables the array controller 70 to queueup and process multiple I/O requests at-a-time. A kernel mode diskdevice driver which may alternatively be used, in which case the I/Orequests passed to the device driver by the operating system will be inthe form of Windows I/O request packets (IRPs). As shown in FIG. 2, thedevice driver maintains and accesses an I/O request status table 102 insystem memory. As described below, the array coprocessor 80 updates thistable 102 (in response to special completion packets received from theautomated controllers 84) to notify the driver 100 of the completion ofpending I/O requests.

[0050]FIG. 3 illustrates the general flow of information between thecomponents of the disk-array system during a typical I/O operation, andwill be used to describe the general operation of the system (includinga technique for monitoring the completion status of pending I/Orequests). To simplify the drawing, the disk drives 72 and buffer 94 areomitted from the figure, and the automated controllers 84 are shown as asingle entity. Throughout the description which follows, it is assumedthat the number of drives N is 8. In addition, the operation of thesystem is described as if only a single I/O request is being processed,although multiple I/O requests will typically be processed concurrently.

[0051] In operation, when the device driver 100 receives an I/O requestfrom the operating system (not shown), the device driver assigns to theI/O request an identification number referred to as a completion token(“token”). In the preferred embodiment, the tokens are 4-bit values thatare recycled (reused) as I/O requests are completed. As depicted in FIG.3, the device driver 100 passes the I/O request (in the general form ofa CDB plus a scatter-gather list) and the token to the microcontroller82 for processing. In addition, the device driver 100 records the tokenin the I/O request status table 102 to maintain a record of the pendingI/O request. This may be accomplished, for example, by settingappropriate status flags associated with the token value.

[0052] Because the array controller 70 can process multiple I/O requestsat-a-time, multiple I/O requests may be recorded within the status table102 at any given time. As described below, the array coprocessor 80automatically updates the status table 102 whenever an I/O request iscompleted, and the device driver 100 monitors the status table 102 todetect the completion of the pending I/O requests. In the preferredembodiment, the I/O requests may be completed by the array controller 70in an order that is different from the order in which the I/O requestsare passed to the array controller 70.

[0053] As further illustrated by FIG. 3, the microcontroller 82 recordsthe I/O request and the token within a “pending I/O request” table 106within its local RAM 98. In addition, the microcontroller 82 translatesthe I/O request into one or more drive-specific sequences of commands,referred to herein as “controller commands.” For example, if, based onthe particular RAID configuration (e.g., RAID 5) implemented by thecontrol program, the I/O request calls for data to be read from orwritten to drives 1, 2 and 8, the microcontroller will generate threesequences of controller commands, one for each of the three drives. Thenumber of controller commands per drive-specific sequence will generallydepend upon the CDB, the RAID configuration, and the number of entrieswithin the scatter-gather list.

[0054] The microcontroller 82 stores these sequences of controllercommands in drive-specific queues 108 within the RAM 98, and dispatchesthe controller commands in sequential order (over the local control bus86) to the corresponding automated controllers 84. For example, if theI/O request invokes drives 1, 2 and 8, controller command sequences willbe written to the respective queues 108 for drives 1, 2 and 8, and theindividual controller commands with thereafter be sequentiallydispatched from these queues to automated controllers 1, 2 and 8respectively. A queue 108 may contain controller commands associatedwith different I/O requests at the same time.

[0055] As described below, a special completion monitoring circuitmonitors the processing of the command sequences by the automatedcontrollers 84 that are invoked by the I/O request, and notifies themicrocontroller 82 when all of the invoked automated controllers 84 havefinished processing their respective command sequences. This eliminatesthe need for the microcontroller 82 to monitor the processing of theindividual command sequences.

[0056] As depicted in FIG. 4, each controller command includes a commandblock, a target address, and transfer information. The command blockspecifies a disk operation, such as a read of a particular sector. Thetarget address references a contiguous area in either the system memory40 or the buffer 94 (FIG. 2) for performing an I/O transfer. Thetransfer information specifies the details of the transfer operation,such as whether the operation will involve an exclusive-OR of datastored in the buffer 94 (FIG. 2).

[0057] As depicted by the dashed line portion in FIG. 4, the lastcontroller command of each sequence additionally includes the tokenvalue that was assigned to the I/O request, a disk-specific completionvalue (“disk completion value”), and the system memory address of thestatus table 102 (FIG. 3). These data items may alternatively betransferred to the automated controller as a separate controllercommand. The disk completion values are generated by the microcontroller82 such that, when all of the disk completion values assigned to the I/Orequest are ORed together, the result is a preselected “final completionvalue” (FFH in the preferred embodiment) that is known to the arraycoprocessor 80. For example, if drives 1, 2 and 8 are invoked, then thefollowing disk completion values can be used to produce a final value ofFFH:

[0058] Drive 1: 01H (00000001B)

[0059] Drive 2: 02H (00000010B)

[0060] Drive 8: FCH (11111100B)

[0061] As described below, the automated controllers 84 transmit thetoken and their respective completion values to the array coprocessor 80as the automated controllers 84 finish their respective portions of theI/O request (i.e., finish processing their respective controller commandsequences), and the array coprocessor cumulatively ORs the diskcompletion values together as they are received to detect the completionof the I/O request. This method enables the array coprocessor 80 toefficiently identify the completion of an I/O request without priorknowledge of the processing details (number of disk drives involved,identities of invoked disk drives, etc.) of the I/O request.

[0062] With further reference to FIG. 3, the automated controllers 84process the controller commands by communicating with their respectivedisk drives 72 (FIG. 2), and by sending packets to the array coprocessor80 over the packet-switched bus 90. In the example above (drives 1, 2and 8 invoked), the I/O request would thus result in packets flowingfrom automated controllers 1, 2 and 8 to the array coprocessor 80. Eachcontroller command spawns the transmission of a sequence of packets(e.g., 16 packets) from the corresponding automated controller 84. (Asused herein, the term “packet” refers generally to a block of binarydata that includes address and control information.)

[0063] As illustrated in FIG. 5, each packet includes a transfercommand, a target address, and an optional payload (depending upon thetype of the packet and the availability of I/O data). The transfercommand specifies an operation to be performed by the array coprocessor80. For example, a packet might include a READ PCI transfer command thatinstructs the array coprocessor 80 to copy a block of data from aspecified system memory address and to a specified buffer address 94.For all packets other than completion packets (discussed below), thetransfer command is derived by the automated controller 84 from thetransfer information (FIG. 4) included within the controller command.The target address specifies a target location, in either the buffer 94(FIG. 2) or the system memory 40 (FIG. 2), to which data is to betransferred or from which data is to be read.

[0064] The transfer commands that are supported by the system are listedand summarized in Table 1. As illustrated by Table 1, if the transfercommand is WRITE BUFFER, XOR BUFFER or WRITE PCI, the payload includesdisk data that has been read from the corresponding disk-drive. In theexample flow shown in FIG. 3, the I/O data is depicted as flowing fromthe array coprocessor 80 to system memory 40, as would be the case whena WRITE PCI command is executed.

[0065] If, on the other hand, the transfer command is READ BUFFER, theautomated controller 84 transmits the command and the target address,and the array coprocessor 80 then “fills in” the payload portion withthe buffer data to be transferred to the disk drive. Thus, although allpackets logically flow from the automated controllers 84 to the arraycoprocessor 80, the packet-switched 90 bus is actually a bi-directionalbus that transfers I/O data in both directions (i.e., from the automatedcontrollers 84 to the array coprocessor 80 and vice versa). The timingassociated with packet transfers is discussed separately below. TABLE 1TARGET TRANSFER COMMAND ADDRESS DESCRIPTION READ BUFFER Buffer AddressRead data from buffer and transfer to automated controller. Payload = 8Dwords of buffer data. WRITE BUFFER Buffer Address Write disk data tobuffer. Payload = 8 Dwords of data read from disk. XOR BUFFER BufferAddress Exclusive OR buffer data with payload data and overwrite inbuffer. Payload = 8 Dwords of data read from disk. WRITE PCI PCI AddressWrite payload data to system memory. Payload = 8 Dwords of data readfrom disk. READ PCI Buffer Address Read data from system memory andwrite to buffer. Payload = PCI address for performing read. WRITE PCICOMPLETE PCI Address Update internally-stored of Status Table completiontable using token and disk completion value included within commandfield. If I/O request is complete, send token to microcontroller, anduse PCI address and token to update status table. No payload.

[0066] As shown in Table 1, packets that carry I/O data have a payloadlength of 8 doublewords (Dwords), where one doubleword=32 bits. Thus, 16packets are needed to move one sector (512 bytes) of I/O data.

[0067] In general, the drives invoked by an I/O request process theirrespective portions (transfers) of the request asynchronously to oneanother, and can finish their respective portions in any order. Inaddition, once a drive/automated controller pair finishes processing theI/O request, the pair can immediately begin processing the next I/Orequest, even though other drives may still be working on the currentI/O request.

[0068] Whenever an automated controller 84 finishes processing the lastcontroller command of a sequence of controller commands—indicating thatthe automated controller has finished its respective portion of the I/Orequest—the automated controller generates a special packet (referred toas a “completion packet”) which includes the WRITE PCI COMPLETE command(Table 1). An I/O request can produce as few as one completion packet(if only one drive is invoked) and as many as eight completion packets(if all eight drives are invoked), and the completion packets can arriveat the array coprocessor 80 in any order. Each completion packetincludes the token, the disk completion value, and the status table(PCI) address that are appended to the last controller command (FIG. 4)of the sequence. The token and disk completion value are included withinthe packet's command field, and the status table address is includedwithin the address field.

[0069] As the completion packets associated with the I/O request (token)are received, the array coprocessor 80 cumulatively ORs the completionvalues together to determine whether any other disk drives are stillworking on the I/O request. The logic circuit used to perform this taskis shown in FIG. 8 and is discussed separately below. With the exceptionof the last completion packet of an I/O request, the array coprocessor80 does not take any external action in response to receiving thecompletion packets.

[0070] As further illustrated by FIG. 3, once the result of thecumulative OR operation equals the final completion value (indicatingthat the last completion packet has been received, and that all driveshave finished processing the I/O request), the array coprocessor 80performs two basic tasks. The first task is to interrupt themicrocontroller 82 and transmit the token (over the local control bus86) to the microcontroller 82. The microcontroller 82 responds to theinterrupt by removing the I/O request from the “pending I/O request”table 106 to reflect that the request has been completed. In general, ifa pending I/O request is not removed from the table 106 within a certaintimeout period, the microcontroller 82 invokes an error processingroutine to process the timeout error.

[0071] The second task performed by the array coprocessor 80 is toupdate a status entry in the status table 102 to indicate to the,devicedriver 100 that processing of the I/O request is complete, and then setan interrupt flag (if not already set) to the host processor 38 togenerate an interrupt request. The update to the status table 102 may bemade, for example, by using the PCI address (included within thecompletion packet) as a base address which points to the status table,and using the token value as an offset into the table. As depicted inFIG. 3, a completion flag associated with the token (I/O request) maythen be set. Because only the last completion packet produces an updateto the status table 102, the status table address may alternatively beomitted from all but one of the completion packets for the I/O request,in which case the array coprocessor 80 may be configured to buffer theaddress (in association with the corresponding token) until it isneeded.

[0072] In another embodiment of the invention, the completion packetsinclude a payload that carries a pointer that is meaningful to devicedriver 100, and the array coprocessor 80 writes this pointer to thestatus table 102 when the last completion packet is received. Thepointer is preferably a value which identifies the I/O request to thedevice driver 100 or the operating system. For example, the pointer maybe an identifier or system memory address of a SCSI request block (SRB)or an I/O request packet (IRP). The advantage of this alternativeimplementation is that it eliminates the need for the device driver 100to use a separate lookup table to match the token number to the pendingI/O request. As with the tokens, the pointer values are preferablypassed to the microcontroller 82 by the device driver 100 (with the I/Orequests) and embedded within the last controller command of eachdrive-specific sequence. The pointer values may also serve as the tokensthemselves, in which case separate token values may be omitted.

[0073] While the interrupt request to the host processor 38 is pending,the array controller 70 continues to process pending I/O requests, andcontinues to update the status table 102 as additional I/O requests arecompleted. When the host processor 38 eventually processes the interruptrequest, the device driver 100 accesses the status table 102 todetermine which of the pending I/O requests have been completed. Whenthe device driver 100 determines that a given I/O request has beencompleted, the device driver notifies the operating system of such, andremoves the I/O request from the status table 102. This feature of thearchitecture (i.e., the ability to process multiple I/O requests perinterrupt) significantly improves the performance of the host computer34 by reducing the frequency at which the host processor 38 isinterrupted. To take advantage of this feature, the device driver 100 ispreferably configured to make use of deferred procedure calls to deferthe processing of the interrupts.

[0074] As will be apparent from the foregoing, an important benefit ofthe present architecture is that the microcontroller 82 does not have tomonitor the constituent disk operations of the I/O request to ensurethat each completes successfully. A related benefit, which is describedfurther below, is that the array coprocessor 80 does not require logicfor correlating the constituent disk operations to the pending I/Orequests. Both of these features are enabled in-part by the use oftokens and completion values to track the completion of I/O requests.

[0075] Another benefit of the architecture is that the microcontroller82 is effectively removed from the I/O data-path. This reduces thecomplexity of the control program, and enables a less expensivemicrocontroller to be used. Another benefit is that the flow of commandinformation to the automated controllers 84 does not interfere with theflow of I/O data, since separate busses are used for the two.

[0076] It will be appreciated that the above-described method formonitoring the completion of I/O requests can also be used in a diskarray system in which each disk controller 84 controls multiple diskdrives. Each disk controller 84 that is invoked by the I/O request wouldstill be assigned a unique disk completion value, but this value wouldbe passed to the array coprocessor 80 only after all of the invoked diskdrives controlled by that controller have finished processing the I/Orequest. It will also be recognized that the I/O requests that aretracked using the above-described technique need not correspondidentically to the I/O requests generated by the operating system. Forexample, the device driver could be configured to combine multiple I/Orequests together for processing, and the above-described method couldbe used to detect the completion of these combined I/O requests.

[0077] III. Local Bus Signals of Array Controller

[0078] The primary interconnections between the components of the arraycontroller 70 will now be described with reference to FIG. 6, whichshows the array coprocessor 80, the microcontroller 82, and a singleautomated controller 84. Throughout FIG. 6, the abbreviation “AC” isused to refer to the automated controllers, and subscripts are used todenote correspondence with drives 1-8.

[0079] As illustrated by FIG. 6, the signal lines that interconnect thearray coprocessor 80 to the automated controllers 84 to form thepacket-switched bus 90 (FIG. 2) include a bus clock (BUSCLK) signal line120, a 32-bit packet bus 90A, and a series of drive-specific request(REQ) and grant (GNT) lines 124, 126. The bus clock line 120 connects toall of the automated controllers 84, and carries a clock signal thatcontrols all packet transfers on the packet-switched bus. In thepreferred embodiment, the bus clock is a 33 MHz signal, and transfers ofpacket data occur at a rate of 32 bits (one doubleword) per clock cycle.In other embodiments, a faster bus clock speed may be used toaccommodate faster and/or greater numbers of disk drives.

[0080] The 32-bit packet bus 90A carries all packet data that istransferred over the packet-switched bus. All packet transfers on this32-bit bus 90A occur between the array coprocessor 80 and one of theautomated controllers 84, with address and control information flowingin one direction (from the automated controllers 84 to the arraycoprocessor 80) and with I/O data flowing in both directions.

[0081] Each automated controller 84 is connected to the arraycoprocessor 80 by a respective request line 124 (labeled REQ₁-REQ₈ inFIG. 6) and a respective grant line 126 (labeled GNT₁-GNT₈). Thesesignal lines carry signals that are used to implement the round robinarbitration protocol. More specifically, the request lines 124 are usedby the respective automated controllers 84 to request timeslots on thepacket-switched bus 90, and the grant lines 126 are used to grant thebus to the individual automated controllers 84. The grant lines 126 arealso used by the array coprocessor 80 to control the framing of packetson the packet-switched bus. A preferred implementation of thearbitration protocol is discussed separately below with reference toFIG. 7.

[0082] As further illustrated by FIG. 6, each automated controller 84connects to the microcontroller 82 by a respective ready signal line 130(labeled RDY₁-RDY₈). Each ready line 130 carries a ready signal that isused by the respective automated controller 84 to request new controllercommands from the microcontroller 82. As described below, the automatedcontrollers 84 double buffer controller commands, so that the nextcontroller command (if available) will be queued-up within the automatedcontroller 84 when the current controller command is completed. Asdepicted in FIG. 6, each ready signal line 130 connects to a respectivePEC (peripheral event controller) input of the Siemens 163microcontroller 82. The use of PECs provides a mechanism for rapidly andefficiently. dispatching the controller commands from the command queues108 (FIG. 3) to the automated controllers 84.

[0083] The remaining signal lines (data, etc.) of the local control busare collectively denoted by reference number 86A in FIG. 6.

[0084] IV. Architecture and General Operation of Array Coprocessor

[0085] With further reference to FIG. 6, the array coprocessor 80includes a buffer control circuit 134, an automated packet processor136, a PCI interface (I/F) 138, a microcontroller interface 140, and anarbitration state machine 142. The buffer control circuit 120 includeslogic for writing to and reading from the buffer 94 (FIG. 2). The buffercontrol circuit 120 also includes parity generation logic and logic forperforming exclusive-OR operations on I/O data.

[0086] The automated packet processor 136 includes logic for parsing andprocessing packets received from the automated controllers 84, includingrouting logic for routing I/O data between the automated controllers onone hand and, the buffer 94 and system memory 40 (FIG. 2) on the other.The packets are processed by the automated packet processor 136according to the transfer commands set forth in Table 1 above. A FIFOmemory (not shown) is included within the automated packet processor 136to temporarily buffer the I/O data that is being transferred.

[0087] In general, each packet received by the automated packetprocessor 136 is a self-contained entity which fully specifies anoperation (including any target address) to be performed by the arraycoprocessor 80. For example, when a packet containing a WRITE PCItransfer command is received, the array coprocessor simply writes thepayload data to the target PCI address specified within the packet,without regard to either the source (disk drive) of the payload data orthe I/O request to which the data corresponds. In this respect, thearray coprocessor 80 acts essentially as a stateless server—executingtransfer commands from the automated controllers 84 (the “clients”)without the need to know the details of the underlying I/O requests. Animportant benefit of this feature is that the logic circuitry of thearray coprocessor 80 is significantly less complex than would bepossible if, for example, the array coprocessor had to “match up” eachincoming packet to its corresponding I/O request.

[0088] The automated packet processor 136 also includes a completionlogic circuit 144 for processing completion packets to detect the end ofan I/O request. As illustrated in FIG. 6, the completion logic circuit144 generates and internal interrupt (INT) signal 148 to the PCI andmicrocontroller interfaces 138, 140 when the last completion packet ofan I/O request is received. Assertion of this interrupt signal causesthe microcontroller interface 140 to interrupt the microcontroller 82,and causes the PCI interface to set the interrupt flag (not shown) tothe host processor 38. The completion logic circuit 144 is described infurther detail below under the heading MONITORING OF I/O REQUESTCOMPLETION.

[0089] The PCI interface 138 includes the basic logic needed to act as aPCI initiator on the host PCI bus 42. Whenever the automated packetprocessor 136 receives a packet that includes data to be written tosystem memory 40, the PCI interface 138 asserts a PCI request line (notshown) to request control of the host PCI bus to complete the transfer.

[0090] As shown in FIG. 6, the PCI interface also includes a mailboxstorage area 150 (“mailbox”) which can be written to by the hostprocessor 38 (FIG. 2). In operation, the device driver 100 writes I/Orequests and tokens to the mailbox 150 to initiate I/O processing. Asdepicted by the path 152 from the mailbox 150 to microcontrollerinterface 140, I/O requests written to the mailbox are passed to themicrocontroller 82 for processing.

[0091] The microcontroller interface 140 includes circuitry forcommunicating with the microcontroller 82. The circuitry included inthis interface 140 is generally dictated by the particularmicrocontroller that is used, which, in the preferred embodiment, is theSiemens 163. As depicted in FIG. 6, the microcontroller interface 140drives an interrupt signal to the microcontroller 82 to enable the arraycoprocessor 80 to interrupt the microcontroller.

[0092] The arbitration state machine 142 implements the control side ofthe round robin arbitration protocol, and controls all accesses to thepacket-switched bus. In a preferred embodiment, the arbitration statemachine 142 samples the request (REQ) lines 124 in a round robin fashion(i.e., in sequential order), and whenever a request line is sampled asactive, grants the packet-switched bus to the corresponding automatedcontroller 84 (by asserting the corresponding grant line) for a timeperiod sufficient for the transfer of a single packet. The arbitrationprotocol is described in detail below under the heading ARBITRATIONPROTOCOL AND TIMING FOR PACKET TRANSFERS.

[0093] V. Architecture and General Operation of Automated Controllers

[0094] With further reference to FIG. 6, each automated controller 84includes a read FIFO 170, a write FIFO 172, and a transfer/commandcontrol circuit 176. The signal lines which connect the automatedcontroller to its corresponding ATA drive include a 16-line data bus 178and a set of ATA control lines 179, all of which form part of a standardATA cable. Each of the units 170, 172, 176 is connected to an internal16-bit data bus 182 for communicating with an ATA drive, and an internal32-bit bus 184 for communicating with the array coprocessor 80. Asillustrated in FIG. 6, the transfer/command control 176 circuit includesa command buffer 180 for storing controller commands that have beenreceived from the microcontroller 82.

[0095] The read FIFO 170 is used to temporarily store I/O data that isbeing transferred from the disk drive 72 to the array coprocessor 80. Asdepicted in FIG. 6, data is written into the read FIFO 170 one word (16bits) at-a-time, and is read-out onto the data bus 90A one doublewordat-a-time. In the preferred embodiment, the read FIFO 170 holds 16doublewords of data, which is the equivalent of two packet payloads.

[0096] In operation, data is written into the read FIFO at the diskdrive's burst rate, which is 16.6 Mbytes/second for ATA mode 4 (EIDE)drives and 33.3 Mbytes/second for Ultra ATA drives. (The sustainedtransfer rates for these drives are typically significantly less becauseof seek times.) Data is read from read FIFO 170 (during allocatedtimeslots) and output onto the data bus 90A at the 33 MHz×4bytes/cycle=132 Mbytes/sec transfer rate of the packet-switched bus. Theread FIFO thus acts as a data accelerator, storing I/O data from thedisk-drive at one speed, and transmitting the data onto the data bus 90Ain time-compressed bursts at a much faster data rate.

[0097] The write FIFO 172 is used to temporarily store I/O data that isbeing transferred from the array coprocessor 80 to the disk drive 72. Asdepicted in FIG. 6, data is written into the write FIFO 172 onedoublebword at-a-time (at the 132 Mbytes/sec transfer rate of thepacket-switched bus), and is transferred to the disk drive one wordat-a-time (at the disk drive's burst rate). The write FIFO thus acts asa data decelerator, accepting I/O data in relatively high-transfer-ratebursts, and transferring the I/O data to the disk drive over longer timeintervals at a relatively slow transfer rate. As with the read FIFO 170,the write FIFO holds 16 doublewords (two packets) of I/O data.

[0098] The transfer/command control 176 circuit includes logic forperforming the following tasks: (i) pre-fetching controller commandsfrom the microcontroller 82 into the command buffer 180, so that thecommand buffer contains the next controller command (if available) whenprocessing of the current controller command is completed, (ii)processing controller commands received from the microcontroller 82 togenerate transfer commands to pass to the disk drive 72, (iii)implementing the “host” side of the ATA protocol to communicate with theATA drive 72, (iv) generating the headers (address and command fields)of packets to be transmitted on the packet-switched bus 90, and gatingthe header data onto the data bus 90A; (v) controlling the flow of datainto and out of the read and write FIFOs 170 and 172, and (vi)generating request (REQ) signals and monitoring grant (GNT) signals toimplement the “client” side of the arbitration protocol. The logiccircuitry used to implement these functions is discussed below under theheading TRANSFER/COMMAND CONTROL CIRCUIT.

[0099] In operation, the transfer/command control circuit 176 assertsthe RDY line 130 to the microcontroller 82 whenever the command buffer180 is empty. Assertion of the RDY line 130 causes the microcontroller82 to issue the next controller command to the automated controller 84from the corresponding queue 108 (FIG. 3). If no controller command iscurrently in the queue, the microcontroller issues the controllercommand when it becomes available (such as when a new I/O request isreceived from the host computer 34). When the microcontroller 82 issuesa controller command to the automated controller 84, thetransfer/command control circuit 176 stores the command block portion(FIG. 4) of the controller command in the command buffer 180 anddeasserts the RDY line 130.

[0100] When the ATA drive becomes ready, the transfer/command controlcircuit 176 writes the command block to the drive for processing. Thecommand block includes the various parameters (cylinder, head, etc.)which specify a disk transfer operation (“disk operation”). If thecontroller command calls for a write of I/O data to the disk, thetransfer/command control circuit 176 also generates and transmitsappropriate packets (with READ BUFFER and/or READ PCI commands) to beginfilling the write FIFO 172 with I/O data. Once the command block iswritten to the disk drive 72, the command buffer 180 becomes empty, andthe transfer/command control circuit 176 reassert the RDY line 130 torequest a new controller command. As discussed below, the target addressand other information needed to complete the transfer over thepacket-switched bus is maintained in separate registers 280 (FIG. 9).

[0101] In typical ATA implementations, a period of disk inactivity or“dead period” occurs while the ATA drive fetches the next disk commandfrom the host computer. This dead period adversely affects the netthroughput of the disk drive. In the preferred embodiment, thearchitecture of the control program is such that the next controllercommand (if available) will be written to the command buffer 180 beforethe disk drive 72 finishes processing the current disk operation. Thus,the latency that would normally be associated with having to fetch a newcontroller command from the microcontroller 82 is avoided. This featureof the architecture enables a high degree of performance to be achievedusing low-cost ATA drives.

[0102] During the processing of the disk operation, the transfer/commandcontrol circuit 176 repeatedly asserts its request (REQ) line 124 to thearray coprocessor 80 to request timeslots on the packet-switched bus 90.For example, if the disk operation is a sector read, thetransfer/command control circuit 176 will assert the request line 124sixteen times to transfer sixteen packets, each containing eightdoublewords of I/O data. As the sequence of packets is transferred, thetransfer/command control circuit 176 increments an internal counter (notshown) to reflect the number of bytes that have been transferred, anduses the counter value to generate appropriate target addresses toinsert within the headers (FIG. 5) of the packets.

[0103] The transfer/command control circuit 176 determines whether toassert the request line 124 either by monitoring the state of the readFIFO 170 (if the disk operation is a disk read) or by monitoring thestate of the write FIFO 172 (if the disk operation is a disk write).Specifically, for disk read operations, the transfer/command controlcircuit 176 asserts the request line 124 whenever the read FIFO 170contains at least one packet (8 doublewords) of I/O data; and for diskwrite operations, the transfer/command control circuit 176 asserts therequest line 124 whenever the write FIFO 172 has sufficient room toreceive at least one packet of I/O data. (As indicated above, each ofthese FIFOs 170, 172 has a capacity that is equivalent to two packets ofI/O data.) Thus, request signals are generated based on the availabilityof these two buffers.

[0104] Whenever the automated controller 84 asserts its request line124, the automated controller will be granted a timeslot in which toperform a packet transfer within a fixed, maximum time period. (Thisfeature of the bus design is a result of the round robin protocol, whichis discussed below.) This maximum time period is approximately equal tothe time needed for all seven of the other automated controllers 84 totransmit maximum-length packets. This maximum time period is preferablyselected such that (i) on disk read operations, the read FIFO 170 willnever become completely full, and (ii) on disk write operations of datastored in the buffer 94, the write FIFO 172 will never prematurelybecome empty. An important benefit of this feature is that the diskdrive will not be required to suspend a disk read or disk writeoperation as the result insufficient bandwidth on the packet-switchedbus. Thus, the packet-switched bus provides a virtual connection betweenthe array coprocessor 80 and every automated controller 84.

[0105] VI. Arbitration Protocol and Timing for Packet Transfers

[0106] As illustrated in FIG. 6 and discussed above, the arraycoprocessor 80 includes an arbitration state machine 142 that grantscontrol of the data bus 90A to the automated controllers 84 using around robin protocol. The arbitration state machine grants control ofthe bus 90A based on the respective states of the request lines 124 fromthe automated controllers 84, and based on transfer status informationreceived from the automated packet processor 136. The automatedcontrollers 84 assert their respective request lines 124 asynchronouslyto one another, and multiple request lines can be asserted during thesame cycle of the bus clock.

[0107]FIG. 7 is a flow diagram which illustrates the basic arbitrationprotocol implemented by the arbitration state machine 142. The variable“N” in the flow diagram is a disk drive reference number which variesbetween 1 and 8. As illustrated by blocks 200-206 of the diagram, whennone of the eight request (REQ) lines are active, the state machine 142remains in a loop in which it samples the requests lines in sequence. Inone implementation, the state machine 142 uses one clock cycle of thebus clock 120 to sample an inactive request line 124 and move on to thenext request line. Thus, when none of the request lines 124 are active,the state machine 142 samples all eight request lines in eight clockcycles. In other implementations, the state machine 142 may beconfigured to sample multiple request lines 124 per clock cycle.

[0108] As illustrated by blocks 202 and 210, when a request line 124 issampled as active, the state machine 142 immediately (i.e., on the sameclock cycle) asserts the corresponding grant line 126 to grant the busto the requesting automated controller 84. On the same clock cycle, thearray coprocessor 80 receives the transfer command (FIG. 5) from theautomated controller 84; and on the following clock cycle, the arraycoprocessor 80 receives the target address from the automated controller84.

[0109] As depicted by blocks 212 and 218, the state machine 142 thencommunicates with the automated packet processor 136 (FIG. 6) todetermine whether or not the packet will include a payload. No payloadis transmitted either if (i) the transfer command is WRITE PCI COMPLETE(block 212), or (ii) the transfer command is READ BUFFER and the targetdata is not yet available in the buffer 94 (block 216). In either ofthese two cases, the state machine 142 deasserts the grant line 126(block 216) to terminate the timeslot, and returns to the sampling loop.

[0110] As represented by block 220, if neither of the above conditionsis met, the state machine 142 continues to assert the grant line 126while the payload is transmitted or received. As discussed above, thepayload is transferred over the data bus 90A (FIG. 6) at a rate of onedoubleword per clock cycle. If the payload is transferred from the arraycoprocessor 80 to an automated controller 84, an extra clock cycle isused as a “dead period” between the header transmission by the automatedcontroller 84 and the payload transmission by the array coprocessor 80.

[0111] An important aspect of this arbitration protocol is that when adisk drive does not use its timeslot, the timeslot is effectivelyrelinquished for other drives to use. Thus, in addition to guaranteeingthat 1/N of the bus's total bandwidth will be available to every driveat all times (i.e., during every round robin cycle), the protocolenables the drives to use more than 1/N of the total bandwidth when oneor more drives are idle. A drive may be able to use this additionalbandwidth, for example, if a cache hit occurs on a disk read, allowingthe drive to return the requested data at a rate which is considerablyhigher than the drive's sustained transfer rate.

[0112] Although the system of the preferred embodiment usesdrive-specific request and grant lines 124, 126 to implement the roundrobin protocol, a variety of alternative techniques are possible. Forexample, the array coprocessor 80 could transmit periodicsynchronization pulses on a shared control line to synchronize theautomated controllers 84, and each automated controller could bepreprogrammed via the control program to use of a different timeslot ofa frame; the automated controllers could then use internal counters todetermine when their respective timeslots begin and end.

[0113] It will also be recognized that although the preferred embodimentuses a round robin arbitration protocol, other protocols can be used toachieve a similar effect. For example, the arbitration state machinecould be designed to implement a protocol in which the bus is granted tothe automated controller 84 that least-recently accessed thepacket-switched bus 90.

[0114] VII. Monitoring of I/O Request Completion

[0115]FIG. 8 illustrates the completion logic circuit 144 of the arraycoprocessor 80, and illustrates the general flow of information thattakes place whenever a completion packet is received. As describedabove, the purpose of the circuit 144 is to monitor the tokens and diskcompletion values contained within completion packets to detect thecompletion of processing of an I/O request. When the circuit 144 detectsthat an I/O request has been completed, the circuit asserts the internalinterrupt line 148, which causes the array coprocessor 80 to interruptthe microcontroller 82 and set the interrupt flag to the host processor38.

[0116] As depicted in FIG. 8, the circuit 144 includes a register file240, an 8-bit logical OR circuit 242, and an 8-bit compare circuit 244.The register file 240 includes sixteen 8-bit registers 248 (labeled0-F). Each register 248 corresponds to a respective 4-bit token andholds the result of the cumulative OR operation for the correspondingI/O request. As described above, the tokens are assigned to pending I/Orequests by the device driver as the I/O requests are passed to thearray controller 70. At any given time, each assigned token correspondsuniquely to a different pending I/O request. Thus, in the implementationdepicted in FIG. 8, up to sixteen I/O requests can be pendingsimultaneously.

[0117] Disk completion values are generated by the control program (suchas by using a lookup table), and are assigned such that the cumulativeOR of all of the completion values assigned to a given I/O requestequals FFH. For example, for an I/O request that only requires access toone drive, a single disk completion value of FF will be assigned to thedisk drive; and for an I/O request that involves all eight disk drives72, each drive will be assigned a disk completion value having adifferent respective bit set (i.e., 00000001, 00000010, 00000100,00001000, 00010000, 00100000, 01000000, and 10000000).

[0118] In operation, whenever a completion packet is received, the tokenand the disk completion value are extracted from the packet and passedas inputs to the completion logic circuit 144. As depicted in FIG. 8,the token is used to address the register file 144, causing thecorresponding cumulative OR value (which will be 0 on the first pass) tobe read from the register file and fed as an input to the OR circuit242. The cumulative OR value is then ORed with the disk completion valueto generate a new completion value. The new completion value is writtenback to the same location 248 in the register file 240, and is alsocompared by the compare circuit 244 with the final completion value ofFFH. If a match occurs (indicating that the last completion packet hasbeen received), the compare circuit 244 asserts the INT line 148, andalso asserts a reset signal (not shown) which causes the addressedlocation in the register file 240 to be reset.

[0119] As indicated above, an important benefit of this method is thatit enables the array coprocessor to 80 to detect the completion of anI/O request without any prior information about the I/O request (such asthe number of drives involved or the type of transfer). Another benefitis that it enables the completion of the I/O request to be rapidlyposted to the host computer 34, regardless of the order in which thedisk drives finish processing their component portions of the I/Orequest.

[0120] VIII. Transfer/Command Control Circuit

[0121]FIG. 9 illustrates the transfer/command control circuit 176 ofFIG. 6 in greater detail, and illustrates the primary signal connectionsof the transfer/command control circuit 176 to other components of thesystem. To simplify the drawing, the read and write FIFOs 170, 172 areshown as a single entity, and the logic for generating request (REQ)signals and monitoring grant (GNT) signals has been omitted.

[0122] As illustrated in FIG. 9, the transfer/command control circuit176 includes a transfer engine 260 and a command engine 262 that areconnected by a START line 264, a DONE line 268, and a transfer commandbus 272. The transfer and command engines 260, 262 include statemachines and other logic which collectively implement the “host” side ofthe ATA protocol (including Ultra ATA). In typical ATA implementations,the host side of the ATA protocol is implemented through firmware. Byautomating the host side of the protocol (i.e., implementing the hostside purely within hardware), a high degree of performance is achievedwithout the need for complex firmware.

[0123] The transfer engine 260 interfaces with the ATA drive 72 via aset of standard ATA signal lines, including chip selects 179A, strobes179B, and an I/O ready line 179C. The transfer engine 260 also includesa set of FIFO control lines 276 that are used to control the flow ofdata into and out of the read and write FIFOs 170, 172.

[0124] The command engine 262 connects to the microcontroller 82 via theready (RDY) line 130 and the local control bus 86A, and connects to thearray coprocessor 80 via the 32-bit data path 90A of the packet-switchedbus. The command engine 262 connects to the ATA drive 72 via the 16-bitATA data bus 178 and the ATA drive's interrupt request (IRQ) line 179D.Included within the command engine 262 are the command buffer 180 and aset of registers 280. As discussed below, the registers 280 are used tohold information (target addresses, etc.) associated with the controllercommands.

[0125] The transfer engine 260 supports three types of disk transferoperations: a 1-cycle STATUS READ, an 8-cycle COMMAND WRITE, and a256-cycle DATA TRANSFER. These operations are initiated by the commandengine by asserting the START signal line 264 and driving the transfercommand bus 272 with a command code. When a STATUS. READ is performed,the transfer engine 260 reads the ATA drive's status register (notshown), and routes the status information to the command engine 262.When a COMMAND WRITE is performed, the transfer engine 260 gates thecontents of the command buffer 180 onto the drive's data bus 178 to copya command block (FIG. 4) to the drive. When a DATA TRANSFER isperformed, the transfer engine 260 transfers one sector of I/O databetween the drive and either the read FIFO 170 or the write FIFO 172.

[0126] With further reference to FIG. 9, the transfer/command controlcircuit 176 processes controller commands generally as follows. Wheneverthe command buffer 180 is empty, the command engine 262 asserts the RDYline 130 to request a new controller command from the microcontroller82. When the microcontroller 82 returns a controller command, thecommand engine 262 deasserts the RDY line 130 and parses the controllercommand. The command block (FIG. 4) is written to the command buffer180, and the remaining portions of the controller command (targetaddress, transfer information, and any completion information) arewritten to the registers 280.

[0127] At this point, the command engine 262 waits until processing ofany ongoing disk operation is complete. Once processing is complete, thecommand engine implements the sequence shown in FIG. 10 (discussedbelow) to control the operation of the disk drive 72. In addition, ifthe controller command calls for data to be written to the disk drive 72and the write FIFO 170 is available, the command engine 262 begins togenerate and send packets on the packet-switched bus to initiate thefilling of the write FIFO 172.

[0128]FIG. 10 illustrates the sequence of transfer operations that areinitiated by the command engine 262. The command engine initiallyrequests a STATUS READ operation to check the status of the drive. Ifthe result of the STATUS READ indicates that firmware intervention willbe required (not shown in FIG. 10), the command engine 262 reports theerror to the microcontroller 82, and the microcontroller enters into anappropriate service routine. If no errors are reported, the commandengine 262 initiates a COMMAND WRITE operation to transfer the commandblock from the command buffer 180 to the ATA drive 72. This causes thecommand buffer 180 to become empty, which in-turn causes the commandengine 262 to reassert the RDY line 130. The command block may specify atransfer of zero sectors, one sector, or multiple sectors.

[0129] After the drive 72 returns from the COMMAND WRITE operation (byasserting the IRQ line 179D), the command engine 262 either (i)initiates a new STATUS READ operation (if no data transfer is required)to begin processing of the next controller command, or (ii) initiates a256-cycle DATA TRANSFER operation to transfer one sector of data betweenthe disk drive and one of the FIFOs 170, 172. When a DATA TRANSFERoperation is completed, the command engine 262 either returns to theSTATUS READ state, or, if additional sector transfers are needed,initiates one or more additional DATA TRANSFER operations.

[0130] One benefit to using automated ATA controllers (as opposed tofirmware) is that on read operations, the data can be retrieved from thedrive as soon as it is available. In addition to reducing latency, thisaspect of the design enables ATA drives with smaller buffers to be usedwithout the usual loss in performance.

[0131] Although this invention has been described in terms of certainpreferred embodiments, other embodiments that are apparent to those orordinary skill in the art are also within the scope of this invention.Accordingly, the scope of the present invention is intended to bedefined only by reference to the appended claims.

[0132] In the claims which follow, reference characters used todesignate claim steps are provided for convenience of description only,and are not intended to imply any particular order for performing thesteps.

What is claimed is:
 1. A disk array system, comprising: a plurality ofcontrollers, each of which controls a respective ATA (AdvancedTechnology Attachment) disk drive, said controllers being external tothe respective ATA disk drives; and a microcontroller that dispatchesdisk drive commands to specific controllers of said plurality ofcontrollers over a bus; wherein each controller stores disk drivecommands received from the microcontroller in a respective commandbuffer, and issues the disk drive commands therefrom to its respectiveATA disk drive for execution; and wherein each controller is configuredto prefetch a next disk drive command from the microcontroller while acurrent disk drive command is being executed by the respective ATA diskdrive, such that the next disk drive command will be available withinthe respective command buffer to issue to the respective ATA disk driveupon completion of the current disk drive command.
 2. The disk arraysystem of claim 1, wherein each controller is implemented withinautomated circuitry.
 3. The disk array system of claim 1, wherein eachcontroller implements a host side of an ATA protocol in automatedcircuitry to control a respective ATA disk drive.
 4. The disk arraysystem of claim 1, wherein the microcontroller dispatches at least someof the disk drive commands to the controllers together with targetsystem memory addresses for transferring input/output data between theATA disk drives and a system memory.
 5. The disk array system of claim1, wherein each controller notifies the microcontroller that it is readyto receive a new command when the command buffer of that controllerbecomes empty.
 6. The disk array system of claim 1, wherein thecontrollers operate without fetching and executing sequences ofmacro-instructions.
 7. The disk array system of claim 1, wherein eachcontroller stores disk drive commands in its respective command buffer,and issues such commands therefrom to its respective ATA disk drive,without fetching and executing sequences of macro-instructions.
 8. Thedisk array system of claim 1, wherein the microcontroller dispatches thedisk drive commands to the controllers over a control bus which isseparate from a bus used to transfer I/O data.
 9. The disk array systemof claim 1, wherein the microcontroller is programmed to implement atleast one RAID configuration.
 10. The disk array system of claim 1,wherein the microcontroller maintains command queues for each of theplurality of controllers, and dispatches disk drive commands from thecommand queues to corresponding controllers.
 11. The disk array systemof claim 1, wherein each controller is coupled to a second bus used totransfer input/output data to and from the ATA disk drives, and eachcontroller includes a bus arbitration circuit that requests time slotson said second bus according to a bus arbitration protocol.
 12. A diskarray system, comprising: a plurality of ATA (Advanced TechnologyAttachment) disk drives; and a plurality of controllers, each of whichcontrols a respective ATA disk drive of the plurality of ATA diskdrives; wherein each controller includes a command buffer that storesdisk drive commands, and issues said disk drive commands therefrom forexecution by the respective ATA disk drive; and wherein each controllerfurther includes a respective automated circuit that prefetches diskdrive commands into its respective command buffer, so that a next diskdrive command is available in the respective command buffer to issuetherefrom when the respective ATA disk drive finishes executing acurrent disk drive command.
 13. The disk array system of claim 12,wherein each controller stores disk drive commands in its respectivecommand buffer, and issues such commands therefrom to its respective ATAdisk drive, without fetching and executing any macro-instructions. 14.The disk array system of claim 12, wherein each controller isimplemented within automated circuitry.
 15. The disk array system ofclaim 12, wherein each controller implements a host side of an ATAprotocol in automated circuitry to control a respective ATA disk drive.16. The disk array system of claim 12, further comprising amicrocontroller that dispatches the disk drive commands to each of theplurality of controllers in response to signals generated by thecontrollers.
 17. The disk array system of claim 16, wherein themicrocontroller dispatches the disk drive commands to the controllerstogether with associated transfer information for performing transfersof input/output data.
 18. The disk array system of claim 16, wherein themicrocontroller maintains a separate command queue for each of thecontrollers, and dispatches disk drive commands from the command queuesto the respective controllers.
 19. The disk array system of claim 16,wherein the microcontroller dispatches the disk drive commands to thecontrollers over a control bus that is separate from a bus used fortransfers of input/output data.
 20. The disk array system of claim 12,wherein each controller is coupled to a bus used to transferinput/output data, and each controller includes a bus arbitrationcircuit that requests time slots on said bus according to a busarbitration protocol.
 21. A method of controlling an ATA disk drive of adisk array system, the method comprising: dispatching a first disk drivecommand to a controller that controls the ATA disk drive, the first diskdrive command specifying a first disk operation, wherein the controlleris external to the ATA disk drive; storing the first disk drive commandin a command buffer of the controller; issuing the first disk drivecommand from the command buffer of the controller to the ATA disk drivefor execution; during execution of the first disk drive command by theATA disk drive, dispatching a second disk drive command to thecontroller, and storing the second disk drive command within the commandbuffer of the controller, wherein the second disk drive commandspecifies a second disk operation; and when the ATA disk drive finishesexecuting the first disk drive command, issuing the second disk drivecommand from the command buffer to the ATA disk drive for execution. 22.The method of claim 21, wherein the steps of issuing the first andsecond disk drive commands from the command buffer to the ATA disk driveare performed entirely by automated circuitry of the controller.
 23. Themethod of claim 21, wherein the steps of issuing the first and seconddisk drive commands from the command buffer to the ATA disk drive areperformed without the controller fetching and executing anymacro-instructions.
 24. The method of claim 21, wherein the steps ofdispatching the first and second disk drive commands to the controllerare performed by a microcontroller.
 25. The method of claim 21, whereinthe first and second disk drive commands are dispatched to thecontroller together with corresponding address and transfer informationfor performing input/output operations.
 26. The method of claim 21,wherein the second disk drive command is dispatched to the controller inresponse to a ready signal generated by the controller.
 27. The methodof claim 21, further comprising automating a host side of an ATAprotocol within application-specific circuitry of the controller tocommunicate with the ATA disk drive.